This is the tutorial for the R primer workshop. I will go through with you some R basics and good practices for data exploration, data visualization, and spatial data manipulation using R.

We will enjoy three sections (use the left panel to navigate)

— R basics and beyond

— Data visualization with ggplot2

— Geospatial data analysis

Notice

  • When you are running and modifying the code yourself, use the workshop.R file

Load the packages

# this section will automatically install the missing packages on your own computer

# list the packages that we need
packages_need <- c("data.table", # read in csv file and doing data analysis
                   "raster", # analyze raster data
                   "sf", # analyze vector data
                   "ggplot2", # data visualization
                   "ggspatial", # plot raster map
                   "RColorBrewer", # give me some nice color palettes
                   "scales",
                   "cowplot",
                   "shiny", # interactive data visualization
                   "shinyjs", # interactive data visualization
                   "leaflet" # interactive data visualization

)


# What the code below does is to check if your R has these packages installed (the require() function), if not, install them (the install.packages() function), and then load the packages using the library() function.

for (package in packages_need){
  if (!require(package, character.only = TRUE)){
    install.packages(package)
  }
  library(package, character.only = TRUE)
}

1 R basics and beyond

1.1 Style guide

Good style is important because while your code only has one author, it’ll usually have multiple readers. This is especially true when you’re writing code with others. Learn more about the good style, click Here

1.1.1 File names

File names should be meaningful and end in .R.

# Good
fit-models.R
utility-functions.R

# Bad
foo.r
stuff.r

If files need to be run in sequence, prefix them with numbers.

0-download.R
1-parse.R
2-explore.R

1.1.2 Variable names

Name Variable in R

  • The variable name generally starts with letter and can contain number, letter, underscore(_) and period(.). Example: day_one, day.one

  • Reserved words or keywords (such as TRUE, FALSE, function, if, else, break,for etc.) are not allowed

  • Special characters such as ‘#’, ‘&’, etc., along with White space (tabs, space) are not allowed

  • One more note about variables: R is a case-sensitive language. So, variable x is not the same as X

Variable Assignment

# Good (shortcut for assignment operator: Option + Minus on MAC and Alt + Minus on Windows & Linux)
x <- 5

# Bad
x = 5

# Good
sample(x = 1:10, size = 5)

To learn more about the differences between <- and =, click Here

1.1.3 Some naming tips

Strive for names that are concise and meaningful (this is not easy!).

# Good
day_one
day_1

# Bad
first_day_of_the_month
dayone
djm1

Where possible, avoid using names of existing functions and variables. Doing so will cause confusion for the readers of your code.

# Bad
T <- FALSE
c <- 10
mean <- function(x) sum(x)

1.2 Arithmetic with R

When using R as a calculator, the order of operations is the same as you would have learned back in school. From highest to lowest precedence:

  1. Parentheses: (, )
  2. Exponentiation and root extraction: ^ or **
  3. multiplication and division: * and /
  4. addition and subtraction: + and -
# First, let's set up our working directory

# An addition (short cut for running selected code: Cmd + Return on Mac and Ctrl + Enter on Windows & Linux)
5 + 5
## [1] 10
# A subtraction
5 - 5 
## [1] 0
# A multiplication
3 * 5
## [1] 15
# A division
9 / 2
## [1] 4.5
# integer division
9 %/% 2
## [1] 4
# x mod y (“x modulo y”)
9 %% 2
## [1] 1
# An Exponentiation
3^2
## [1] 9
# A root extraction
4^(0.5)
## [1] 2
# make if fancy
3 * (4/5 + 2^2)
## [1] 14.4
# assign the value to a variable
x <- 5

1.3 Data types in R

R works with many data types. Some of the most basic types to get started are:

  • Decimal values (or double precision floating point numbers) like 4.5 are called numeric
  • Whole numbers like 4 are called integers. Integers are also numeric
  • Boolean values (TRUE or FALSE) are called logical
  • Text (or string) values are called character
# Declare variables of different types
my_integer <- 1
my_decimal <- 1.5
my_character <- "REU student"
my_logical <- TRUE

# Check class of my_integer and my_decimal
class(my_integer)
## [1] "numeric"
class(my_decimal)
## [1] "numeric"
class(my_character)
## [1] "character"
class(my_logical)
## [1] "logical"
# OK, how about some real integer types
# The advantage in using real integer types is that it can save you some storage memory in your computer
# But in most cases, you don't need to care!
my_real_integer <- 1L
class(my_real_integer)
## [1] "integer"

1.4 Objects in R

  • Vector — one-dimension array that can hold numeric data, character data, or logical data
  • Factor — special vector that represents categorical data. Factors can be ordered or unordered
  • Matrix — two-dimension array
  • data.frame — two-dimension objects. It has the variables of a data set as columns and the observations as rows.
  • List — perhaps the most flexible object in R.

Notice

  • For Vector, factor and matrix, each variable can only contain the same data type!
  • For data.frame, within a column all elements have the same data type, but different columns can be of different data types
  • For list, it can be seen as a collection of elements without any restriction on the data types, length or structure of each element

1.4.1 Vector

# Shortcut for inserting pipe operator: Cmd + Shift + M on MAC and Ctrl + Shift + M on Windows & Linux)

# vector
# "c" means concatenate
v_num <- c(1, 2, 3, 4, 5)
v_char <- c("a", "b", "c")
v_logical <- c(TRUE, FALSE, FALSE)

# test different data types in a vector
# 1 and 2 will be automatically converted to character data type
v_test <- c(1, 2, "a")

# check the length of the vector
length(v_num)
## [1] 5
# select some values from the vector
v_num[1]
## [1] 1
v_num[1:3]
## [1] 1 2 3
v_num[c(1, 5)]
## [1] 1 5
v_num[-1]
## [1] 2 3 4 5
v_num > 2
## [1] FALSE FALSE  TRUE  TRUE  TRUE
v_num[v_num > 2]
## [1] 3 4 5
# give some names to the vector
names(v_num) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

# select based on name
v_num["Monday"]
## Monday 
##      1
# sum up two vectors
v_a <- c(1, 2, 3, 4, 5)

v_b <- c(2, 4, 6, 8, 10)

v_total <- v_a + v_b

# concatenate two vectors
v_join <- c(v_a, v_b)

# do some basic statistics on vector

# mean value
mean(v_a)
## [1] 3
# standard deviation
sd(v_a)
## [1] 1.581139
# quickly create a vector

# create a vector sequence
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
# create a vector sequence with 2 as the step
seq(1, 10, by = 2)
## [1] 1 3 5 7 9
# create a vector with 5 replicates
rep(2, times = 5)
## [1] 2 2 2 2 2
# create a vector with lower-case or captial letters
letters[1:10]
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
LETTERS[1:10]
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
# create a long vector
v_long <- 1:1000

# use head and tail to show the first or last several elements in the vector
head(v_long)
## [1] 1 2 3 4 5 6
tail(v_long)
## [1]  995  996  997  998  999 1000

1.4.2 Factor

# Shortcut for inserting pipe operator: Cmd + Shift + M on MAC and Ctrl + Shift + M on Windows & Linux)

# unordered (nominal categorical variable)
v_fruit <- c("apple", "banana", "orange", "peach")

f_fruit <- factor(v_fruit)

f_fruit[1] > f_fruit[2]
## Warning in Ops.factor(f_fruit[1], f_fruit[2]): '>' not meaningful for factors
## [1] NA
# ordered (ordinal categorical variable)
v_temp <- c("High", "Low", "High","Low", "Medium")

f_temp <- factor(v_temp)

f_temp <- factor(v_temp, levels = c("Low", "Medium", "High"), ordered = TRUE)

f_temp[1] > f_temp[2]
## [1] TRUE

1.4.3 Matrix

v_a <- 1:6

# create a 2*3 matrix
m_a <- matrix(v_a, nrow = 2, ncol = 3)

m_a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
matrix(v_a, nrow = 2)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
# force it to expand along the row
matrix(v_a, nrow = 2, byrow = TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
# check the total dimension
dim(m_a)
## [1] 2 3
# check how many rows
nrow(m_a)
## [1] 2
# check how many columns
ncol(m_a)
## [1] 3
# select specific row
m_a[1, ]
## [1] 1 3 5
# select specific columns
m_a[, 1]
## [1] 1 2
# let's display m_a again
m_a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
# calculate the totals for each row of a matrix
rowSums(m_a)
## [1]  9 12
# calculate the totals for each column of a matrix
colSums(m_a)
## [1]  3  7 11
# calculate the mean for each row of a matrix
rowMeans(m_a)
## [1] 3 4
# calculate the sd for each row or each column
# unfortunately, we don't have a built-in rowSds...

# we use the apply function
# Margin = 1 means along the row (row-wise)
# Margin = 1 2 means along the column (column_wise)

apply(m_a, MARGIN = 1, FUN = sd)
## [1] 2 2
apply(m_a, MARGIN = 2, FUN = sd)
## [1] 0.7071068 0.7071068 0.7071068
# check if the apply is right
sd(m_a[1, ])
## [1] 2
sd(m_a[, 1])
## [1] 0.7071068

1.4.4 Data.frame

We use the mpg data set that is shipped with the ggplot2 package (we already loaded ggplot2 so we can directly access the data). mpgprovides fuel economy data from 1999 and 2008 for 38 popular models of cars.



1.4.4.1 Basic structure

dim(mpg) # dimension
## [1] 234  11
# as you can see, each column in data.frame has the same data type, while different columns can have different data types
str(mpg) # data structure
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
summary(mpg) # data statistics
##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00
head(mpg) # first several rows
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
tail(mpg) # last several rows
## # A tibble: 6 x 11
##   manufacturer model  displ  year   cyl trans     drv     cty   hwy fl    class 
##   <chr>        <chr>  <dbl> <int> <int> <chr>     <chr> <int> <int> <chr> <chr> 
## 1 volkswagen   passat   1.8  1999     4 auto(l5)  f        18    29 p     midsi…
## 2 volkswagen   passat   2    2008     4 auto(s6)  f        19    28 p     midsi…
## 3 volkswagen   passat   2    2008     4 manual(m… f        21    29 p     midsi…
## 4 volkswagen   passat   2.8  1999     6 auto(l5)  f        16    26 p     midsi…
## 5 volkswagen   passat   2.8  1999     6 manual(m… f        18    26 p     midsi…
## 6 volkswagen   passat   3.6  2008     6 auto(s6)  f        17    26 p     midsi…

1.4.4.2 Learn data.table

# we convert the data.frame to data.table (a special type of data.frame that we will use for all the data manipulation)
dt_mpg <- as.data.table(mpg)

# select elements of data.frame by position
dt_mpg[1, ]
##    manufacturer model displ year cyl    trans drv cty hwy fl   class
## 1:         audi    a4   1.8 1999   4 auto(l5)   f  18  29  p compact
dt_mpg[, 1]
##      manufacturer
##   1:         audi
##   2:         audi
##   3:         audi
##   4:         audi
##   5:         audi
##  ---             
## 230:   volkswagen
## 231:   volkswagen
## 232:   volkswagen
## 233:   volkswagen
## 234:   volkswagen
dt_mpg[1:2, ]
##    manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1:         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2:         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
dt_mpg[1:2, c(2,3)]
##    model displ
## 1:    a4   1.8
## 2:    a4   1.8
# extract 1 column as a vector
head(dt_mpg[[1]])
## [1] "audi" "audi" "audi" "audi" "audi" "audi"
# select elements of data.frame by name
# the .() operator keeps the output as data.table
dt_mpg[1:5, .(year)]
##    year
## 1: 1999
## 2: 1999
## 3: 2008
## 4: 2008
## 5: 1999
dt_mpg[1:5, .(model, year)]
##    model year
## 1:    a4 1999
## 2:    a4 1999
## 3:    a4 2008
## 4:    a4 2008
## 5:    a4 1999
dt_mpg[1:5, .(model, year)]
##    model year
## 1:    a4 1999
## 2:    a4 1999
## 3:    a4 2008
## 4:    a4 2008
## 5:    a4 1999
# extract 1 column as a vector
dt_mpg[1:5, year]
## [1] 1999 1999 2008 2008 1999
# or you can use the dollar sign
head(dt_mpg$year)
## [1] 1999 1999 2008 2008 1999 1999
# select elements of data.frame by logical vector
head(dt_mpg$year > 2005)
## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE
head(dt_mpg[dt_mpg$year > 2005, ])
##    manufacturer      model displ year cyl      trans drv cty hwy fl   class
## 1:         audi         a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 2:         audi         a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 3:         audi         a4   3.1 2008   6   auto(av)   f  18  27  p compact
## 4:         audi a4 quattro   2.0 2008   4 manual(m6)   4  20  28  p compact
## 5:         audi a4 quattro   2.0 2008   4   auto(s6)   4  19  27  p compact
## 6:         audi a4 quattro   3.1 2008   6   auto(s6)   4  17  25  p compact
# since we are using data.table, why not use the tidy way?
head(dt_mpg[year > 2005, ])
##    manufacturer      model displ year cyl      trans drv cty hwy fl   class
## 1:         audi         a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 2:         audi         a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 3:         audi         a4   3.1 2008   6   auto(av)   f  18  27  p compact
## 4:         audi a4 quattro   2.0 2008   4 manual(m6)   4  20  28  p compact
## 5:         audi a4 quattro   2.0 2008   4   auto(s6)   4  19  27  p compact
## 6:         audi a4 quattro   3.1 2008   6   auto(s6)   4  17  25  p compact
# calculate the average mileage of highway and city for each single row
head(dt_mpg[, mileage_mean := (cty + hwy)/2])
##    manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1:         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2:         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3:         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4:         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5:         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6:         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact
##    mileage_mean
## 1:         23.5
## 2:         25.0
## 3:         25.5
## 4:         25.5
## 5:         21.0
## 6:         22.0
# or a fancy way
head(dt_mpg[, mileage_mean := rowMeans(.SD), .SDcols = c("cty", "hwy")])
##    manufacturer model displ year cyl      trans drv cty hwy fl   class
## 1:         audi    a4   1.8 1999   4   auto(l5)   f  18  29  p compact
## 2:         audi    a4   1.8 1999   4 manual(m5)   f  21  29  p compact
## 3:         audi    a4   2.0 2008   4 manual(m6)   f  20  31  p compact
## 4:         audi    a4   2.0 2008   4   auto(av)   f  21  30  p compact
## 5:         audi    a4   2.8 1999   6   auto(l5)   f  16  26  p compact
## 6:         audi    a4   2.8 1999   6 manual(m5)   f  18  26  p compact
##    mileage_mean
## 1:         23.5
## 2:         25.0
## 3:         25.5
## 4:         25.5
## 5:         21.0
## 6:         22.0
# calculate average highway mileage for all car models
dt_mpg[, mean(hwy)]
## [1] 23.44017
# But if we want to keep the result as data.table, we can use the .() operator
dt_mpg[, .(hwy_mean = mean(hwy))]
##    hwy_mean
## 1: 23.44017
# calculate average highway mileage for different classes
dt_mpg[, .(hwy_mean = mean(hwy)), by = class]
##         class hwy_mean
## 1:    compact 28.29787
## 2:    midsize 27.29268
## 3:        suv 18.12903
## 4:    2seater 24.80000
## 5:    minivan 22.36364
## 6:     pickup 16.87879
## 7: subcompact 28.14286
# order the average highway mileage of different classes
dt_mpg[, .(hwy_mean = mean(hwy)), by = class][order(hwy_mean)]
##         class hwy_mean
## 1:     pickup 16.87879
## 2:        suv 18.12903
## 3:    minivan 22.36364
## 4:    2seater 24.80000
## 5:    midsize 27.29268
## 6: subcompact 28.14286
## 7:    compact 28.29787
# order the average highway mileage of different classes (descending)
dt_mpg[, .(hwy_mean = mean(hwy)), by = class][order(-hwy_mean)]
##         class hwy_mean
## 1:    compact 28.29787
## 2: subcompact 28.14286
## 3:    midsize 27.29268
## 4:    2seater 24.80000
## 5:    minivan 22.36364
## 6:        suv 18.12903
## 7:     pickup 16.87879
# drop one column
dt_mpg[, year := NULL]

# drop several columns
dt_mpg[, c("drv", "cyl") := NULL]

dt_mpg
##      manufacturer  model displ      trans cty hwy fl   class mileage_mean
##   1:         audi     a4   1.8   auto(l5)  18  29  p compact         23.5
##   2:         audi     a4   1.8 manual(m5)  21  29  p compact         25.0
##   3:         audi     a4   2.0 manual(m6)  20  31  p compact         25.5
##   4:         audi     a4   2.0   auto(av)  21  30  p compact         25.5
##   5:         audi     a4   2.8   auto(l5)  16  26  p compact         21.0
##  ---                                                                     
## 230:   volkswagen passat   2.0   auto(s6)  19  28  p midsize         23.5
## 231:   volkswagen passat   2.0 manual(m6)  21  29  p midsize         25.0
## 232:   volkswagen passat   2.8   auto(l5)  16  26  p midsize         21.0
## 233:   volkswagen passat   2.8 manual(m5)  18  26  p midsize         22.0
## 234:   volkswagen passat   3.6   auto(s6)  17  26  p midsize         21.5

2 Data visualization with ggplot2

2.1 Structure of ggplot2

Every ggplot2 plot has three key components:

  • Data
  • A set of aesthetic mappings between variables in the data and visual properties
  • At least one layer which describes how to render each observation. Layers are usually created with a geom function.

2.2 Symbols, lines and colors

symbols

line types

color

Note that when using rgb() in R, 0-255 needs to be scaled to 0-1.

2.3 Dive into ggplot2

2.3.1 What we do here is just setting up a blank canvas

ggplot()

2.3.2 Add our data. Still a blank canvas as we don’t add any aesthetics

ggplot(data = mpg)

2.3.3 Add our aesthetics. Now we have x and y axes specified that correspond to in the aes() argument.

ggplot(data = mpg, aes(x=displ, y=cty))

2.3.4 Add our scatter points

ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point()

2.3.5 Modify our symbols

ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill="red", color="black")

2.3.6 Continue to modify our symbols

ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill="red", color="black", stroke=0.1, alpha = 0.7)

2.3.7 Modify our theme

ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill="red", color="black", stroke=0.1, alpha = 0.7) +
  theme_bw()

ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill="red", color="black", stroke=0.1, alpha = 0.7) +
  theme_classic()

ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill="red", color="black", stroke=0.1, alpha = 0.7) +
  theme_minimal()

2.3.8 Modify grids in the plot

ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill="red", color="black", stroke=0.1, alpha = 0.7) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank()
        )

2.3.9 Modify our axis titles

ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill="red", color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank()
        )

2.3.10 Modify the font size

ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill="red", color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14)
        )

2.3.11 Increase the margin between the axis title and the axis

ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill="red", color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10))
        )

2.3.12 Give different colors to the points based on different classes

ggplot(data = mpg, aes(x=displ, y=cty, fill=class)) +
  geom_point(shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10))
        )

2.3.13 Change the color palette using RColorBrewer

# use the "Accent" palette
ggplot(data = mpg, aes(x=displ, y=cty, fill=class)) +
  geom_point(shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  scale_fill_brewer(palette = "Accent") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10))
        )

# use the "Dark2" palette
ggplot(data = mpg, aes(x=displ, y=cty, fill=class)) +
  geom_point(shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  scale_fill_brewer(palette = "Dark2") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10))
        )

### Alternative to RColorBrewer palette

color_manual <- c("#001219", "#0a9396", "#94d2bd", "#e9d8a6", "#ee9b00", "#ca6702", "#9b2226")

# use show_col from the scales package to see your color
show_col(color_manual)

# use the "Accent" palette
ggplot(data = mpg, aes(x=displ, y=cty, fill=class)) +
  geom_point(shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  scale_fill_manual(values=color_manual) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10))
        )

2.3.14 Modify the legend aesthetics

# use the "Accent" palette
ggplot(data = mpg, aes(x=displ, y=cty, fill=class)) +
  geom_point(shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  scale_fill_brewer(palette = "Accent") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        legend.position = "bottom",
        legend.background = element_rect(color = "gray", size = 0.5),
        legend.title = element_text(size=12),
        legend.text = element_text(size=10)
        )

2.3.15 Continue to modify the legend

# use the "Accent" palette
ggplot(data = mpg, aes(x=displ, y=cty, fill=class)) +
  geom_point(shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  scale_fill_brewer(palette = "Accent") +
  guides(fill = guide_legend(nrow = 1)) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        legend.position = "bottom",
        legend.background = element_rect(color = "gray", size = 0.5),
        legend.title = element_text(size=12),
        legend.text = element_text(size=10)
        )

2.3.16 Facets

# use the "Accent" palette
ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill = "red", color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  facet_wrap(vars(class)) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10))
        )

2.3.17 Modify the size of panel title

# use the "Accent" palette
ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=3, fill = "red", color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  facet_wrap(vars(class)) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        strip.text = element_text(size = 12)
        )

2.3.18 Add linear regression lines

# use the "Accent" palette
ggplot(data = mpg, aes(x=displ, y=cty)) +
  geom_point(shape=21, size=2, fill = "red", color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  geom_smooth(method = lm, se = TRUE, color="#999999", size = 0.5) +
  facet_wrap(vars(class)) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        strip.text = element_text(size = 12)
        )
## `geom_smooth()` using formula 'y ~ x'

2.3.19 Continuous color scale using viridis palette

ggplot(data = mpg, aes(x=displ, y=cty, fill=hwy)) +
  geom_point(shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  scale_fill_viridis_c(option = "magma") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10))
        )

# change the legend aesthetics
range(mpg$hwy)
## [1] 12 44
ggplot(data = mpg, aes(x=displ, y=cty, fill=hwy)) +
  geom_point(shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x="engine displacement in liters",
       y = "city mileage") +
  scale_fill_continuous(type = "viridis",
                        option = "magma",
                        limits = c(10, 45), # limits are needed to make the labels work properly
                        breaks = seq(10, 45, by = 5),
                        labels = seq(10, 45, by = 5)) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        legend.title = element_text(size=14, vjust = 5),
        legend.text = element_text(size=12),
        legend.key.height = unit(15, units = "mm"),
        legend.key.width = unit(3, units = "mm")
        )

2.3.20 Different plot types

2.3.20.1 Barplot

# Bar plot: how many cars are in each class
ggplot(data = mpg) +
  geom_bar(aes(x=class)) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        )

# Bar plot: let's flip the x-y axis
ggplot(data = mpg) +
  geom_bar(aes(x=class)) +
  coord_flip() +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        )

2.3.20.2 Boxplot

# Boxplot of hwy in each class

ggplot(data = mpg) +
  geom_boxplot(aes(x=class, y=hwy)) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        )

# Boxplot of hwy in each class and in each cylinder
ggplot(data = mpg) +
  geom_boxplot(aes(x=class, y=hwy, fill=as.factor(cyl)), size = 0.2) +
  coord_flip() +
  scale_fill_brewer(name="Number of cylinders", palette = "Accent") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        )

2.3.20.3 Histograms

# Histograms of hwy

ggplot(data = mpg) +
  geom_histogram(aes(x=hwy)) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Histograms of hwy in each class
ggplot(data = mpg) +
  geom_histogram(aes(x=hwy, fill=class), bins = 20, position = "dodge") +
  scale_fill_brewer(palette = "Accent") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        )

# Density of hwy
ggplot(data = mpg) +
  geom_density(aes(x=hwy), size=0.2) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        )

# Density of hwy in each class

# we can clearly see the differences between small and big cars
ggplot(data = mpg) +
  geom_density(aes(x=hwy, fill=class), size=0.2, alpha = 0.5) +
  scale_fill_brewer(palette = "Accent") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        )

# OK, let's save our figure
ggsave(filename = "density_plot.pdf", width = 6, height = 4)

3 Geospatial data analysis

Notice

  • We can start with the basics of geospatial analysis, such as Geographic vs Projected coordinate reference systems, projection transformation, and raster vs. vector format, but let’s forget about all of these and start with a real example

3.1 GCOOS air temperature data

# learn more about time counting here: https://www.alexejgossmann.com/benchmarking_r/
# much faster using fread from data.table
system.time(dt_temp <- fread(file.path("data", "gcoos_2020_01_air_temperature.csv")))
##    user  system elapsed 
##   0.041   0.003   0.045
# slower when reading csv files using the built-in read.csv function
system.time(df_test <- read.csv(file.path("data", "gcoos_2020_01_air_temperature.csv")))
##    user  system elapsed 
##   0.278   0.009   0.287
# use rmarkdown to make the table pretty
rmarkdown::paged_table(dt_temp)
# check the data
str(dt_temp)
## Classes 'data.table' and 'data.frame':   175044 obs. of  6 variables:
##  $ network        : chr  "ADCP" "CenGOOS" "LUMCON" "LUMCON" ...
##  $ platform       : chr  "ioos:station:wmo:42395" "ioos:station:wmo:42067" "ioos:station:LUMCON:102" "ioos:station:LUMCON:wisl1" ...
##  $ latitude       : num  26.4 30 29.2 29.1 28.9 ...
##  $ longitude      : num  -90.8 -88.6 -90.6 -90.2 -89.4 ...
##  $ date           : POSIXct, format: "2020-01-01 00:00:00" "2020-01-01 00:00:00" ...
##  $ air_temperature: num  17.9 13.9 13.7 50.2 -9999 ...
##  - attr(*, ".internal.selfref")=<externalptr>
ggplot(dt_temp) +
  geom_histogram(aes(x=air_temperature))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).

quantile(dt_temp$air_temperature, prob = seq(0, 1, 0.01), na.rm = TRUE)
##         0%         1%         2%         3%         4%         5%         6% 
## -9999.0000 -9999.0000 -9999.0000 -9999.0000 -9999.0000 -9999.0000 -9999.0000 
##         7%         8%         9%        10%        11%        12%        13% 
## -9999.0000 -9999.0000     5.0000     6.7000     7.9000     8.8000     9.4000 
##        14%        15%        16%        17%        18%        19%        20% 
##     9.9000    10.4000    10.7000    11.1000    11.4000    11.6000    11.9000 
##        21%        22%        23%        24%        25%        26%        27% 
##    12.1000    12.4000    12.6000    12.8000    13.0000    13.2000    13.5000 
##        28%        29%        30%        31%        32%        33%        34% 
##    13.7000    13.9000    14.1000    14.3000    14.5644    14.7000    14.9000 
##        35%        36%        37%        38%        39%        40%        41% 
##    15.1000    15.3000    15.4000    15.6000    15.8000    15.9000    16.1000 
##        42%        43%        44%        45%        46%        47%        48% 
##    16.3000    16.4000    16.6000    16.7000    16.9000    17.1000    17.2000 
##        49%        50%        51%        52%        53%        54%        55% 
##    17.4000    17.6000    17.7000    17.9000    18.1000    18.2000    18.4000 
##        56%        57%        58%        59%        60%        61%        62% 
##    18.5000    18.7000    18.8000    19.0000    19.1000    19.2000    19.3000 
##        63%        64%        65%        66%        67%        68%        69% 
##    19.5000    19.6000    19.7000    19.9000    20.0000    20.1000    20.3000 
##        70%        71%        72%        73%        74%        75%        76% 
##    20.4000    20.6000    20.7000    20.9000    21.1000    21.3000    21.4000 
##        77%        78%        79%        80%        81%        82%        83% 
##    21.6000    21.8000    22.0000    22.2000    22.4000    22.6000    22.8000 
##        84%        85%        86%        87%        88%        89%        90% 
##    23.1000    23.3000    23.6000    23.8000    24.2000    24.5000    24.8000 
##        91%        92%        93%        94%        95%        96%        97% 
##    25.2000    25.6000    25.9000    26.3000    26.7000    27.2000    27.6000 
##        98%        99%       100% 
##    28.1000    73.3000    93.4000
# filter out the bad ones
dt_temp <- dt_temp[air_temperature > -20 & air_temperature < 40]

ggplot(dt_temp) +
  geom_histogram(aes(x=air_temperature))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# average air temperature by platform
dt_mean <- dt_temp[, 
                   .(temp = mean(air_temperature), 
                       network = first(network), 
                       lat = first(latitude), 
                       lon = first(longitude)),
                   by = platform]
dt_mean
##                                  platform     temp network     lat      lon
##   1:               ioos:station:wmo:42395 21.05588    ADCP 26.4040 -90.7920
##   2:               ioos:station:wmo:42067 14.63252 CenGOOS 30.0430 -88.6490
##   3:              ioos:station:LUMCON:102 15.20056  LUMCON 29.1870 -90.6093
##   4:         ioos:station:NOAA.NDBC:CDRF1 15.18620    NDBC 29.1360 -83.0290
##   5:         ioos:station:NOAA.NDBC:FWYF1 22.08396    NDBC 25.5910 -80.0970
##  ---                                                                       
## 113: ioos:station:NOAA.NOS.CO-OPS:8776139 17.41861     NOS 27.4800 -97.3220
## 114:            ioos:station:LUMCON:wisl1 37.98065  LUMCON 29.1144 -90.1840
## 115:            ioos:station:WAVCIS:CSI06 16.89686  WAVCIS 28.8667 -90.4833
## 116:               ioos:station:DISL:MBLA 12.52477    DISL 30.4367 -88.0117
## 117:           ioos:station:USF.COMPS:C10 17.33979   COMPS 27.1690 -82.9260
# let's plot our data

ggplot(data = dt_mean, aes(x=lon, y=lat, fill=temp)) +
  geom_point(shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x = "Longitude",
       y = "Latitude") +
  scale_fill_continuous(name = "Temperature (Celsius)",
                        type = "viridis",
                        option = "magma") +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 14),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        legend.title = element_text(size=12, vjust = 5),
        legend.text = element_text(size=10),
        legend.key.height = unit(15, units = "mm"),
        legend.key.width = unit(3, units = "mm")
        )

Although we can plot our temperature as above, we are not plotting it in the geographic context!

3.2 Dive into geospatial analysis

3.2.1 Coordinate systems

Coordinate System is the most general term for a system that includes coordinates. It includes two common types:

A geographic coordinate system (GCS) is a reference framework that defines the locations of features on a model of the earth. It’s shaped like a globe—spherical. Its units are angular, usually degrees.

A projected coordinate system (PCS) is flat. It contains a GCS, but it converts that GCS into a flat surface, using math (the projection algorithm) and other parameters. Its units are linear, most commonly in meters.

3.2.1.1 Geographic coordinate system

3.2.1.2 Projected coordinate system

Learn more

  • US

3.2.2 Vector vs. Raster

There are two fundamental types of spatial objects: vectors and rasters. The main difference between Raster and Vector Data is that the raster data represents data as a cell or a grid matrix while vector data represents data using sequential points or vertices. Raster data can be made from vector data and vice versa depending on the goals of an analysis. Learn more

3.2.2.1 Vector

3.2.2.2 Raster

3.2.3 Revisit GCOOS air temperature

# convert data into a spatial vector
sf_temp <- st_as_sf(dt_mean, coords = c("lon", "lat"), crs = 4326)

# check the data
sf_temp
## Simple feature collection with 117 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -97.4706 ymin: 17.002 xmax: -80.0333 ymax: 30.7083
## geographic CRS: WGS 84
## First 10 features:
##                        platform     temp network                geometry
## 1        ioos:station:wmo:42395 21.05588    ADCP  POINT (-90.792 26.404)
## 2        ioos:station:wmo:42067 14.63252 CenGOOS  POINT (-88.649 30.043)
## 3       ioos:station:LUMCON:102 15.20056  LUMCON POINT (-90.6093 29.187)
## 4  ioos:station:NOAA.NDBC:CDRF1 15.18620    NDBC  POINT (-83.029 29.136)
## 5  ioos:station:NOAA.NDBC:FWYF1 22.08396    NDBC  POINT (-80.097 25.591)
## 6  ioos:station:NOAA.NDBC:KTNF1 14.14388    NDBC  POINT (-83.593 29.819)
## 7  ioos:station:NOAA.NDBC:PTAT2 16.67026    NDBC  POINT (-97.051 27.826)
## 8  ioos:station:NOAA.NDBC:SANF1 22.45478    NDBC  POINT (-81.877 24.456)
## 9  ioos:station:NOAA.NDBC:SAUF1 15.75027    NDBC  POINT (-81.265 29.857)
## 10 ioos:station:NOAA.NDBC:SGOF1 15.93136    NDBC  POINT (-84.858 29.408)
# read in land outline
# we use st_read from sf package to read in vector file (here it is a .shp format)
sf_land <- st_read(file.path("data", "ne_50m_land", "ne_50m_land.shp"))
## Reading layer `ne_50m_land' from data source `/Users/shuang-zhang/Dropbox/Stuff/00-TAMU/REU/R_workshop/R_codes/data/ne_50m_land/ne_50m_land.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1420 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -180 ymin: -89.99893 xmax: 180 ymax: 83.59961
## geographic CRS: WGS 84
ggplot() +
  geom_sf(data=sf_land, fill="#dedede", size=0.1, color="#aaaaaa") +
  geom_sf(data=sf_temp, aes(fill=temp), shape=21, size=1, color="black", stroke=0.1, alpha = 0.7) +
  labs(x = "Longitude",
       y = "Latitude") +
  scale_fill_continuous(name = "Temperature",
                        type = "viridis",
                        option = "magma") +
  coord_sf(xlim = c(-180, 180), ylim = c(-90, 90), expand = FALSE) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        legend.title = element_text(size=10, vjust = 2),
        legend.text = element_text(size=8),
        legend.key.height = unit(10, units = "mm"),
        legend.key.width = unit(3, units = "mm")
        )

# change the projection to Mercator
sf_land_crop <- st_crop(sf_land, c(xmin = -180, ymin = -85, xmax = 180, ymax = 85))
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
sf_land_mercator <- st_transform(sf_land_crop, crs = 3857)

sf_temp_mercator <- st_transform(sf_temp, crs = 3857)

ggplot() +
  geom_sf(data=sf_land_mercator, fill="#dedede", size=0.1, color="#aaaaaa") +
  geom_sf(data=sf_temp_mercator, aes(fill=temp), shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x = "x",
       y = "y") +
  scale_fill_continuous(name = "Temperature",
                        type = "viridis",
                        option = "magma") +
  coord_sf(expand = FALSE, datum = 3857) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 8),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        legend.title = element_text(size=10, vjust = 2),
        legend.text = element_text(size=8),
        legend.key.height = unit(10, units = "mm"),
        legend.key.width = unit(3, units = "mm")
        )

# crop the land to the Gulf of Mexico region
c_us_crop <- c(xmin = -105, ymin = 15, xmax = -75, ymax = 35)

sf_land_us <- st_crop(sf_land, c_us_crop)
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
ggplot() +
  geom_sf(data=sf_land_us, fill="#dedede", size=0.1, color="#aaaaaa") +
  geom_sf(data=sf_temp, aes(fill=temp), shape=21, size=3, color="black", stroke=0.1, alpha = 0.7) +
  labs(x = "Longitude",
       y = "Latitude") +
  scale_fill_continuous(name = "Temperature",
                        type = "viridis",
                        option = "magma") +
  coord_sf(expand = FALSE) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        legend.title = element_text(size=10, vjust = 2),
        legend.text = element_text(size=8),
        legend.key.height = unit(10, units = "mm"),
        legend.key.width = unit(3, units = "mm")
        )

3.2.4 Raster data manipulation

# Now we read in the global average sea surface temperature
# we use raster from raster package to read in raster file (here it is a .tif format)
rast_temp <- raster(file.path("data", "BO2_tempmean_ss_lonlat.tif"))

rast_temp
## class      : RasterLayer 
## dimensions : 2160, 4320, 9331200  (nrow, ncol, ncell)
## resolution : 0.08333333, 0.08333333  (x, y)
## extent     : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs 
## source     : /Users/shuang-zhang/Dropbox/Stuff/00-TAMU/REU/R_workshop/R_codes/data/BO2_tempmean_ss_lonlat.tif 
## names      : BO2_tempmean_ss_lonlat 
## values     : -1.797733, 30.17863  (min, max)
# plot the raster (will see how we can plot this using the ggspatial package coupled with ggplot2)
# the simple way
plot(rast_temp)

# check crs 
crs(rast_temp)
## CRS arguments: +proj=longlat +datum=WGS84 +no_defs
# check the resolution
res(rast_temp)
## [1] 0.08333333 0.08333333
# check the dimension
dim(rast_temp)
## [1] 2160 4320    1
# check the extent
extent(rast_temp)
## class      : Extent 
## xmin       : -180 
## xmax       : 180 
## ymin       : -90 
## ymax       : 90
# Calculate Raster Min and Max Values
minValue(rast_temp)
## [1] -1.797733
maxValue(rast_temp)
## [1] 30.17863
# get the pixel value
rast_temp[1, 1]
##           
## -1.716213
# calculate the mean temperature
cellStats(rast_temp, mean)
## [1] 13.82039
cellStats(rast_temp, sd)
## [1] 11.48417
sd(values(rast_temp), na.rm = TRUE)
## [1] 11.48417
# crop the raster
c_us_rast_crop <- c(xmin = -105, xmax = -75, ymin = 15, ymax = 35)

rast_temp_crop <- crop(rast_temp, y = c_us_rast_crop)

plot(rast_temp_crop)

# Here we use ggplot2 and ggspatial to plot the raster in the ggplot2 way
ggplot() +
  geom_sf(data=sf_land_us, fill="#dedede", size=0.1, color="#aaaaaa") +
  layer_spatial(data = rast_temp_crop, aes(fill = stat(band1))) +
  scale_fill_continuous(name = "Temperature",
                        type = "viridis",
                        option = "magma",
                        na.value = NA) +
  coord_sf(expand = FALSE) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        legend.title = element_text(size=10, vjust = 2),
        legend.text = element_text(size=8),
        legend.key.height = unit(10, units = "mm"),
        legend.key.width = unit(3, units = "mm")
        )
## Warning: Removed 40481 rows containing missing values (geom_raster).

3.2.5 Raster-Vector interactions

# read in the shapefile of marine regions from International Hydrographic Organization (https://www.marineregions.org/gazetteer.php?p=details&id=4288)


sf_iho <- st_read(file.path("data", "World_Seas_IHO_v1", "World_Seas.shp"))
## Reading layer `World_Seas' from data source `/Users/shuang-zhang/Dropbox/Stuff/00-TAMU/REU/R_workshop/R_codes/data/World_Seas_IHO_v1/World_Seas.shp' using driver `ESRI Shapefile'
## Simple feature collection with 101 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -180 ymin: -85.47029 xmax: 180 ymax: 90
## geographic CRS: WGS 84
# extract the Gulf of Mexico
sf_gulf <- sf_iho[sf_iho$NAME == "Gulf of Mexico", ]

# check the extraction
ggplot() +
  geom_sf(data=sf_land_us, fill="#dedede", size=0.1, color="#aaaaaa") +
  geom_sf(data = sf_gulf, fill = "#9bf6ff")

# mask the seawater temperature with the shapefile of the Gulf of Mexico
rast_temp_crop
## class      : RasterLayer 
## dimensions : 240, 360, 86400  (nrow, ncol, ncell)
## resolution : 0.08333333, 0.08333333  (x, y)
## extent     : -105, -75, 15, 35  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs 
## source     : memory
## names      : BO2_tempmean_ss_lonlat 
## values     : 21.01431, 29.46556  (min, max)
rast_temp_gulf <- mask(rast_temp_crop, sf_gulf)

rast_temp_gulf
## class      : RasterLayer 
## dimensions : 240, 360, 86400  (nrow, ncol, ncell)
## resolution : 0.08333333, 0.08333333  (x, y)
## extent     : -105, -75, 15, 35  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs 
## source     : memory
## names      : BO2_tempmean_ss_lonlat 
## values     : 23.08181, 28.14442  (min, max)
plot(rast_temp_gulf)

# shrink the extent
rast_temp_gulf_trim <- trim(rast_temp_gulf)

plot(rast_temp_gulf_trim)

# plot the final product
ggplot() +
  geom_sf(data=sf_land_us, fill="#dedede", size=0.1, color="#aaaaaa") +
  layer_spatial(data = rast_temp_gulf_trim, aes(fill = stat(band1))) +
  scale_fill_continuous(name = "Temperature",
                        type = "viridis",
                        option = "magma",
                        na.value = NA) +
  coord_sf(expand = FALSE) +
  theme_bw() +
  theme(panel.grid.major = element_line(size = 0.5, linetype = "dashed"),
        panel.grid.minor = element_blank(),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10),
        axis.title.x = element_text(margin = margin(t=10)),
        axis.title.y = element_text(margin = margin(r=10)),
        legend.title = element_text(size=10, vjust = 2),
        legend.text = element_text(size=8),
        legend.key.height = unit(10, units = "mm"),
        legend.key.width = unit(3, units = "mm")
        )
## Warning: Removed 11577 rows containing missing values (geom_raster).

# extract the raster points to a data.frame so that you can use it later
df_temp_gulf <- rasterToPoints(rast_temp_gulf_trim)

dt_temp_gulf <- as.data.table(df_temp_gulf)

rmarkdown::paged_table(dt_temp_gulf)

3.2.6 Interactive maps

Notice

  • Let’s use the leaflet package to plot our map interactively
# define the color palette
pal <- colorNumeric(
    palette = "magma",
    domain = values(rast_temp_gulf_trim),
    na.color = NA
)

leaflet() %>% 
    addTiles() %>%
    addRasterImage(x = rast_temp_gulf_trim, colors = pal, opacity = 1) %>% 
    addLegend("bottomright", 
              pal = pal, 
              values = values(rast_temp_gulf_trim),
              title = "Temperature",
    )

4 Summary and beyond

  • Through this tutorial, you have learned the R basics, data visualization and geospatial analysis
  • You are encouraged to install the open-source QGIS for better data interaction
  • Some aspects are not covered in this course (such as the if else statement, the for loop, and string manipulation) partly due to the time limit and partly due to our focus on data analysis (instead of solving equations and analyzing texts). You can easily learn these stuff from online tutorials, such as here
  • There are also additional tutorials and cheatsheets in the R_workshop word document
  • Below are some good cheatsheets for your convenience